Introduction

Session overview

In this session you will calculate probabilities, quantiles and confidence intervals using the normal distribution.

Learning Outcomes

By actively following the materials and carrying out the independent study before and after the contact hours the successful student will be able to:

  • Explain the properties of ‘normal distributions’ (MLO 1 and 2)
  • Define the sampling distribution of the mean and the standard error (MLO 1 and 4)
  • Explain what a confidence interval is (MLO 1 and 4)
  • Calculate probabilities and quantiles and in R (MLO 3 and 4) Calculate confidence intervals for large and small samples in R (MLO 3 and 4)

Philosophy

Workshops are not a test. It is expected that you often don’t know how to start, make a lot of mistakes and need help. Do not be put off and don’t let what you can not do interfere with what you can do. You will benefit from collaborating with others and/or discussing your results. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. You may wish to refer to the independent study materials for reference.

Materials are indexed here: https://3mmarand.github.io/BIO00017C-Data-Analysis-in-R-2020/

Key

These four symbols are used at the beginning of each instruction so you know where to carry out the instruction.

W is something you need to do on your computer. It may be opening programs or documents or locating a file.

R is something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files.

GC is something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file.

Q is question for you to think about an answer. You will usually want to record your answers in your script for future reference.

Artwork by Allison Horst

Getting started

W Start RStudio from the Start menu.

R Make an RStudio project for this workshop by clicking on the drop-down menu on top right where it says Project: (None) and choosing New Project and then New Directory. Navigate to the “data-analysis-in-r” folder. Name the RStudio Project ’workshop3.

R Make a new script then save it with a name like analysis.R to carry out the rest of the work.

R Load the tidyverse:

library(tidyverse)

Exercises

Distributions: the R functions

For any distribution, two very useful quantities can be calculated:
- the Distribution Function, which gives the probability that a variable takes a particular value or less.
- the Quantile function which is the inverse of the Distribution function, i.e., it returns the value (‘quantile’) for a given probability.

The functions are named with a letter p or q preceding the distribution name. Below are some examples:

Probability Quantile
Binomial distribution pbinom() qbinom()
Normal distribution pnorm() qnorm()
t distribution pt() qt()

Calculating Probabilities for single value

Using pnorm()

Look up pnorm() in the manual using ?pnorm

You give it a values for which you want a probability and by default it gives you the probability of getting that value or less from a normal distribution with a mean of 0 and a standard deviation of 1. If you want the probability of a value from a different normal distribution you need to set the mean and standard deviation appropriately.

For example, I.Q. in the U.K. population is normally distributed with a mean of 100 and a standard deviation of 15. We can use pnorm() to calculate probabilities associated with having a particular range of IQs.

We can use the values of mean = 100 and standard deviation = 15 in pnorm() to work out the probability of having an I.Q. of 115 or less.

R First, create variables for the parameter values - this is considered good practice.

# create variables for the parameter values
m <- 100
sd <- 15

R Now pass those variables to the pnorm() function along with the value for which we want a probability:

# Now use pnorm()
pnorm(115, m, sd)
## [1] 0.8413447

R Look at the manual page. Because the default is lower.tail = TRUE, we get the probability we want, P[IQ < 115]

I recommend sketching the distribution and shading the area you want to work out what arguments you want to give the function.

R Determine the probability of having an IQ of 115 OR MORE? Do a sketch first.

R Determine the probability of having an IQ between 85 and 115? Do a sketch first.

QIs this what you expect?

R What is 1.96 * the standard deviation

R What is the probability of having an IQ between -1.96 standard deviations and +1.96 standard deviations? Is this what you expect?

Using qnorm()

We can use qnorm() to find the IQ associated with a particular probability.

We will again use the values of mean = 100 and standard deviation = 15 in qnorm() to work out what I.Q. value 0.2 (20%) of people fall below. Make sure you relate the manual information to the command.

R To find the I.Q. value that 20% people fall below:

 qnorm(0.2, m, sd)
## [1] 87.37568

20% people have an IQ less than 87.4

R What I.Q. value are 0.025 (2.5%) of people below?

R In what range do 99% of the population fall? Note that 99% means 1% (0.01) in both tails so 0.5% (0.005) in each tail. The figure may help you.

Calculating Probabilities for samples

The only difference in using pnorm() and qnorm() for samples is in what we give as the sd argument. Since we are now thinking about the distribution of the sample means, we need to use the standard error.

We used mean = 100 and standard deviation = 15 in pnorm() to work out the probability of an individual having an I.Q. of 115 or less.

We can use a similar approach to find the probability of getting a sample of n = 5 having a mean I.Q. of 115 or less The only difference is that we use the standard error instead of the standard deviation.

R First, calculate the standard error:

n <- 5
se <- 15 / sqrt(n)

R Now the probability of getting a sample mean of 115 or less from that distribution:

pnorm(115, m, se)
## [1] 0.9873263

There’s a 0.9873 probability that a sample of 5 people will have a mean of 115 or less. Thus there is a probability of just 0.0127 that a sample of n = 5 will have a mean above 115. This is quite unlikely and we might suspect this group was not sampled from the general population.

R What is the probability of sample of size 10 having a mean of 105 or more?

Confidence intervals (large samples)

The data in beewing.txt are left wing widths of 100 honey bees (mm). The confidence interval for large samples is given by: \(\bar{x} \pm 1.96 \times s.e.\))

Where 1.96 is the quantile for 95% confidence.

You may need to refer to previous practicals to remind yourself how to carry out some of the following steps.

W Save a copy of the file. I saved mine to my ‘data’ directory

R Read in the data and check the structure of the resulting dataframe

R Rename the column to ‘wing’

R Calculate and assign to variables: the mean, standard deviation and standard error

R To calculate the 95% confidence interval we need to look up quantile (multiplier) using qnorm()

q <- qnorm(0.975)

R Now we can use it in our confidence interval calculation

lcl <- m - q * se
ucl <- m + q * se

R Between what values would you be 99% confident of the population mean being?

Confidence intervals (small samples)

The confidence interval for small samples is given by: \(\bar{x} \pm \sf t_{[d.f]} \times s.e.\)

The fatty acid Docosahexaenoic acid (DHA) is a major component of membrane phospholipids in nerve cells and deficiency leads to many behavioural and functional deficits. The cross sectional area of neurons in the CA 1 region of the hippocampus of normal rats is 155 \(\mu m^2\). A DHA deficient diet was fed to 8 animals and the cross sectional area (csa) of neurons is given in neuron.txt

W Save a copy of the file. I saved mine to my ‘data’ directory

R Read in the data and check the structure of the resulting dataframe

R Assign the mean to m

R Calculate and assign the standard error to se

To work out the confidence interval for our sample mean we need to use the t distribution because it is a small sample. This means we need to determine the degrees of freedom (the number in the sample minus one).

R We can assign this to a variable using:

df <- length(neur$csa) - 1; df
## [1] 7

R The t value is found by:

t <- qt(0.975, df = df); t
## [1] 2.364624

R And the confidence interval by:

round(m + t * se, 2)
## [1] 151.95
round(m - t * se, 2)
## [1] 132.75

Q Given the upper and lower confidence values for the estimate of the population mean, what do you think about the effect of the DHA deficient diet?

🎂 Well Done! 🎉

Artwork by @allison_horst

Independent study following the workshop

1. Calculate confidence intervals

Adiponectin is exclusively secreted from adipose tissue and modulates a number of metabolic processes. Nicotinic acid can affect adiponectin secretion. 3T3-L1 adipocytes were treated with nicotinic acid or with a control treatment and adiponectin concentration (pg/mL) measured. The data are in adipocytes.txt. Each row represents an independent sample of adipocytes and the first column gives the concentration adiponectin and the second column indicates whether they were treated with nicotinic acid or not. Estimate the mean Adiponectin concentration in each group - this means calculate the sample mean and construct a confidence interval around it for each group. Something to bear in mind is that R works ‘elementwise’ which means you can use vectors of two values in a formula to get a vector of two results in the same way that you would use a single value. See workshop 1

The Code files

These contain all the code needed in the workshop even where it is not visible on the webpage.

Rmd file The Rmd file is the file I use to compile the practical. Rmd stands for R markdown. It allows R code and ordinary text to be interweaved to produce well-formatted reports including webpages. If you right-click on the link and choose Save-As, you will be able to open the Rmd file in RStudio. Alternatively, View in Browser.

Plain script file This is plain script (.R) version of the practical generated from the Rmd. Again, you can save this and open it RStudio. Alternatively, View in Browser.

Pages made with rmarkdown (Allaire Xie, et al., 2019a; Xie Allaire, et al., 2018a), kableExtra (Zhu, 2019a), RefManager (McLean, 2014)

References

Allaire, J., Y. Xie, et al. (2019a). rmarkdown: Dynamic Documents for R. R package version 1.16. URL: https://github.com/rstudio/rmarkdown.

McLean, M. W. (2014). Straightforward Bibliography Management in R Using the RefManager Package. arXiv: 1403.2036 [cs.DL]. URL: https://arxiv.org/abs/1403.2036.

Xie, Y., J. Allaire, et al. (2018a). R Markdown: The Definitive Guide. ISBN 9781138359338. Boca Raton, Florida: Chapman and Hall/CRC. URL: https://bookdown.org/yihui/rmarkdown.

Zhu, H. (2019a). kableExtra: Construct Complex Table with ‘kable’ and Pipe Syntax. R package version 1.1.0. URL: https://CRAN.R-project.org/package=kableExtra.